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Abstract: It is shown analytically how a neural network can be used optimally 
to encode input data that is derived from a toroidal manifold. The case of a 
2-layer network is considered, where the output is assumed to be a set of dis- 
crete neural firing events. The network objective function measures the average 
Euclidean error that occurs when the network attempts to reconstruct its input 
from its output. This optimisation problem is solved analytically for a toroidal 
input manifold, and two types of solution are obtained: a joint encoder in which 
the network acts as a soft vector quantiser, and a factorial encoder in which the 
network acts as a pair of soft vector quantisers (one for each of the circular 
subspaces of the torus). The factorial encoder is favoured for small network 
sizes when the number of observed firing events is large. Such self-organised 
factorial encoding may be used to restrict the size of network that is required 
to perform a given encoding task, and will decompose an input manifold into 
its constituent submanifolds. 

1 Introduction 

The purpose of this paper is to show analytically how a neural network can be 
used to optimally encode input data that is derived from a toroidal manifold. 
For simplicity, only the case of a 2-layer network is considered, and an objec- 
tive function is defined pQ that measures the average ability of the network to 
reconstruct the state of its input layer from the state of its output layer. The 
optimum network parameter values must then minimise this objective function. 
In this paper the output state is chosen to be the vector of locations of a finite 
number of the neural firing events that arise when an input vector is presented 
to the network, and, in the limit of a single firing event, this reduces to a 
winner-take-all encoder network. 

If the input vector is obtained from an arbitrary input probability density 
function (PDF) , then the network would have to be optimised numerically, and 
a simple interpretation of its optimal parameters would not then be guaranteed. 
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On the other hand, if the input PDF is constrained to have a simple enough form, 
then an analytic optimisation guarantees that the results can be interpreted. 
Because the purpose of this paper is mainly to interpret the nature of the optimal 
solution(s) that arise from the interplay between the input PDF and the network 
objective function, an analytic rather than a numerical approach will be used. 

The detailed form of the optimum network parameters depends on the cho- 
sen input PDF, and, for simplicity, the input PDF will be chosen to define a 
curved manifold which is uniformly populated by all of the allowed input vec- 
tors. The shape of this manifold then determines the type of optimum solution 
that the network adopts. For instance, a 1-dimensional linear manifold with a 
uniform distribution of input vectors leads to an optimum solution in which each 
neuron fires only if the input lies within a small range of values, so the network 
behaves as a soft scalar quantiser. This result generalises to higher dimensional 
linear manifolds, where the network behaves as a soft vector quantiser. A more 
interesting type of optimum solution can occur when the manifold is curved. For 
instance, a circular manifold (which is a 1-dimensional manifold embedded in a 
2-dimensional space) leads to an optimum solution that is analogous to the soft 
scalar quantiser obtained with a 1-dimensional linear manifold, but a toroidal 
manifold (which is a 2-dimensional manifold embedded in a 4-dimensional space) 
does not necessarily lead to an optimum solution that is analogous to the soft 
vector quantiser obtained with a 2-dimensional linear manifold. 

For a 2-dimensional toroidal manifold, it is possible for the optimum solution 
to be constructed out of a pair of soft scalar quantisers, each of which encodes 
only one of the two circular manifolds that form the toroidal manifold. This is 
called a factorial encoder (because it breaks the input into its constituent factors, 
which it then encodes), as opposed to a joint encoder (which directly encodes the 
input, without first breaking it into its constituent factors). Because a factorial 
encoder splits up the overall encoding problem into a number of smaller encoding 
problems, which it then tackles in parallel, it requires fewer neurons than a joint 
encoder would have needed for the same encoding problem. 

For the type of network objective function that is discussed in this paper, 
factorial encoding does not occur with linear manifolds. This is because the 
random nature of the neural firing events does not guarantee that at least one 
such event occurs in each of the soft scalar quantisers in a factorial encoder, and, 
for a linear manifold, this leads to a much larger average reconstruction error 
if a factorial encoder is used than if a joint encoder is used. This effect is sum- 
marised in figureQlfor a linear manifold, and in figureEJfor a toroidal manifold. 
Henceforth, only the toroidal case will be discussed, because it is a curved man- 
ifold which thus has interesting factorial encoding properties, whereas a linear 
manifold would not. 

In figure E^a) the torus is overlaid with a 20 x 20 toroidal lattice, and a 
typical joint encoding cell is highlighted (this would use a total of 400 = 20 x 20 
neurons). Figure a) makes clear why such encoding is described as "joint", 
because the response of each neuron depends on the values of both dimensions 
of the input. The neural network implementation of this type of joint encoder 
would have connections from each output neuron to all of the input neurons. 
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Figure 1: Diagram (a) shows the encoding cells for joint encoding of a 2- 
dimensional linear manifold; a typical encoding cell is shaded. Diagram (b) 
shows the corresponding encoding cells for a factorial encoder; typical encoding 
cells for each of the two factors and their intersection are shaded. The distortion 
that would result from only one of the two factors is large, because the encoding 
cell is a long thin rectangular region. 



In figure |2{b) the torus is overlaid with a 20 x 20 toroidal lattice, and a 
typical pair of intersecting factorial encoding cells is highlighted (this would use 
a total of 40 = 20 + 20 neurons) . Figure ffi b) makes clear why such encoding 
is described as "factorial", because the response of each neuron depends on only 
one of the dimensions of the input, or, in other words, on only one factor that 
parameterises the input space. The neural network implementation of this type 
of factorial encoder would have connections from each output neuron to only 
half of the input neurons. In figure Etb) an accurate encoding is obtained by 
a process that is akin to triangulation, in which the intersection between the 
2 orthogonal encoding cells defines a region of the 2-torus that is equivalent to 
the corresponding joint encoding cell in figure a). 

For a toroidal input manifold it turns out that there is an upper limit to the 
number of neurons that can be used if a factorial encoder is to have a smaller 
average reconstruction error than the corresponding joint encoder. This limit is 
smaller than the number of neurons that are used in figure[2{b) , so that diagram 
should not be interpreted too literally. 

1.1 Vector Quantisers 

The existing literature on the simplest type of encoder (i.e. the vector quantiser 
(VQ)) includes the following examples: 

1. A standard VQ, in which the input space is partitioned into a number 
of non-overlapping encoding cells, which is also known as an LBG vector 
quantiser (after the initials of the authors of [5]). In operation, all of the 
input vectors that lie closest (in the Euclidean sense) to a given code vector 
are assigned the same code index (which thus defines an encoding cell), 
and the approximate reconstruction of these inputs is then the centroid 



3 



Figure 2: Diagram (a) shows the encoding cells for joint encoding of a 2- 
dimensional toroidal manifold; a typical encoding cell is shaded. Diagram (b) 
shows the corresponding encoding cells for a factorial encoder; typical encoding 
cells for each of the two factors and their intersection are shaded. The distortion 
that would result from only one of the two factors is not as large as in the case 
of the corresponding linear manifold, because the long thin rectangular encoding 
cells are now wrapped round into loops, thus reducing the average separation (in 
the Euclidean sense) of points within each encoding cell. 



of the encoding cell. This type of VQ can be viewed as a single-layer 
winner-take-all (WTA) neural network. 

2. A topographic VQ (TVQ), in which the code indices and encoding cells are 
arranged so that code indices that differ by a small amount are assigned 
to encoding cells that are close to each other (in the Euclidean sense). 
This topographic property automatically emerges if a VQ is optimised for 
encoding input vectors to be transmitted along a noisy communication 
channel El El- The Kohonen topographic mapping network [2] is 
an approximation to this type of encoder, as was explained in The 
TVQ may be generalised to a soft TVQ (STVQ) in which each code index 
is chosen probabilistically in response to the corresponding input vector 

3. Simultaneously use more than one standard VQ, with each VQ encoding 
only a subspace of the input (see for example QH]); in effect, more than 
one code index is used to encode the input vector. By this means, a high- 
dimensional space can be split up into a number of lower dimensional 
pieces. This type of VQ is equivalent to multiple single-layer WTA neural 
network modules, each of which operates on a subspace of the input. This 
is an example of a factorial encoder, in which the input is split into a 
number of separate parts, or factors. 

4. The simultaneous use of multiple VQs can be extended to a tree-like net- 
work of VQs JU- This type of VQ is equivalent to multiple single layer 
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WTA neural network modules which are connected together in a tree-like 
network of modules. 

For simplicity, only the case of a 2-layer network (i.e. an input and an 
output layer) will be considered, but otherwise the network will be obliged to 
learn how to make use of all of its neurons. The simplest encoder which has 
all of the required behaviour, and which includes the above 2-layer examples 
as special cases, is one in which the neurons fire discretely in response to the 
input, and, after a finite number of firing events has occurred, the input is then 
reconstructed as accurately as possible (in the Euclidean sense). In the special 
case where only a single firing event is observed, this reduces to a standard 
LBG vector quantiser that was discussed in case above. In the more general 
case, where a finite number of firing events is observed, this can lead to factorial 
encoder networks of the type that was discussed in case above. 

1.2 Curved Manifolds 

The purpose of this paper is to derive optimal ways of encoding data using 
neural networks in which multiple firing events are observed, and to show that 
factorial encoder networks can be optimal when the input data lies on a curved 
manifold. In order to get a feel for how curved manifolds arise in image data, 
consider the examples shown in figure and figure 21 which show the manifold 
generated by a single target (figure 01 and by a pair of targets (figure QJ, when 
projected onto three neighbouring pixels (i.e. the locus of the 3-vector formed 
from these pixel values is plotted as the target (s) move around). 

Clearly, these image manifolds are curved, and the curvature gets greater 
the narrower the Gaussian profiles used to generate the target images become. 

It is not at all obvious how best to encode vectors that lie on such man- 
ifolds. For instance, one might try to tile the manifold with a large number 
of small encoding cells obtained from some variant of a VQ, or one might try 
to project the manifold onto a basis obtained from some variant of principal 
components analysis (PC A). In fact, these two examples are both special cases 
of the approach that is advocated in this paper; a VQ corresponds to a single 
firing event, whereas PCA corresponds to an infinite number of firing events. 

The problem of optimally encoding data that is derived from a general curved 
manifold requires a numerical solution. However, in order to develop our un- 
derstanding, it is best to start with an analytically tractable example based on 
a simple curved manifold, which is carefully selected to preserve the essential 
features of more general curved manifolds. With this in mind, the most impor- 
tant feature to preserve in the analytic example is curvature. A circle is the 
simplest 1-dimensional curved manifold, which may then be used to construct 
higher dimensional toroidal manifolds. For instance, a pair of circles may be 
used to construct the 2-dimensional toroidal manifold shown in figureEl It turns 
out that, if a toroidal manifold is used, then the network objective function can 
be analytically minimised to yield results that exhibit interesting joint encoder 
and factorial encoder properties. 
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Figure 3: Manifold formed when the 1- dimensional image of a target ( a Gaussian 
profile with a half-width of one pixel) is moved around. Only the projection Aij 
onto the pixels at = (—1,0), (0,0) and (1,0) is shown. 



1.3 Structure of this Paper 

In section the basic theoretical framework is introduced, from which some 
expressions are derived for optimising a network which is trained on data from 
a toroidal input manifold. In section the detailed results for encoding a cir- 
cular input manifold are given (which are trivially related to the corresponding 
results for the case of joint encoding of a 2-torus), and in section 01 these results 
are extended to the case of factorial encoding of a 2-torus. The results for joint 
encoding and factorial encoding are compared in section Some useful asymp- 
totic approximations are discussed in section and a useful approximation to 
the optimal network is discussed in section 

The main steps in the derivations are reported in the appendices to this 
paper, and in several cases there is a considerable amount of algebra involved, 
which was done using algebraic manipulator software |12| . 

2 Basic Theoretical Framework 

The encoder model that is assumed throughout this paper is a 2-layer network 
of neurons. The state of the input layer is denoted as an input vector x (which 
is assumed in this paper to be a continuous activity pattern), and the state of 
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Figure 4: Manifold formed when the 2- dimensional image of a target ( a Gaussian 
profile with half-widths of one pixel in each direction) is moved around. Only 
the projection Aij onto the pixels at = (—1,1), (0,0) and (1,1) is shown. 



the output layer is denoted as the output vector y (which is assumed in this 
paper to be a discrete pattern of firing events) . The information content of the 
output state y may be used to draw inferences about the input state x. This 
can be formalised by using Bayes' theorem in the form 



p Pr(y|x)Pr(x) 
r ™ ~ f dx'Pr (y|x') Pr(x') 



(1) 



where the PDF Pr (x|y) of the input x given that the output y is known (i.e. 
the generative model) is completely determined by two quantities: the likeli- 
hood Pr (y|x) that output y occurs when input x is present (i.e. the recogni- 
tion model), and the prior PDF Pr (x) that input x could occur irrespective 
of whether y is being observed. However, for all but the most trivial situa- 
tions, if the functional form of Pr(y|x) is simple then the functional form of 
Pr(x|y) is complicated (or vice versa, with the roles of Pr(y|x) and Pr(x|y) 
interchanged). In other words, if the recognition and generative models are 
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strictly related by Bayes' theorem, then difficulties inevitably arise in analytic 
and numerical calculations. 

A possible way around this problem is to use a network objective function 
Dq that has a simple functional form for the Pr (y|x), but has an approximation 
to the ideal Pr(x|y) implied by Bayes' theorem (or vice versa). A convenient 
choice is 

D = - f dx^Pr(x,y)logQ(x,y) 
J y 

= - f dx Pr (x) Pr (y |x) log Q (x|y) - £ Pr (y) log Q (y) (2) 
J y y 

Pr(x,y) is a joint probability that satisfies Pr(x,y) = Pr(y|x)Pr(x) = 
Pr(x|y)Pr(y) (i.e. Bayes' theorem holds), Q(x, y) is an approxima- 
tion to Pr (x, y) that satisfies the corresponding relationships Q (x, y) = 
Q (y|x) Q (x) = Q (x|y) Q (y), / dxPr (x) (• • • ) integrates over all the possible 
states of the input layer, J2 y P r (y| x ) (' ' ' ) sums over all the possible states of the 
output layer given that the state of the input layer is known, and J2 y P r (y) (' ' ' ) 
sums over all the possible states of the output layer. 

The objective function Dq measures the average number of bits required 
when the approximate joint probability Q (x, y) is used as a reference to encode 
each pair (x, y) drawn randomly from the true joint probability Pr (x, y) |14| . 
so D belongs to the class of minimum description length (MDL) objective 
functions jT^]. Strictly speaking, the number of bits depends on the accuracy 
with which the continuous-valued x is measured. However, this refinement is 
omitted from equation because it does not affect the results in this paper, 
provided that the size of the quantisation cells into which x is binned is much 
smaller than the scale on which Pr (x|y) and Q (x|y) fluctuate. 

The objective function Dq can be simplified if Q (x, y) is assumed to have 
the following properties 

Q (y) — constant 

Q ( X | y) = L_ exp f jl*-*'(y)l| 2 > \ (3) 

(V2^a) dlmX V 2ff2 / 

where the approximation Q (x|y) to the true generative model Pr(x|y) is a 
Gaussian PDF, and the prior probabilities Q (y) are constrained to all be equal. 
If the value of a is fixed, then Do may be replaced by the simpler, but equivalent, 
vector quantiser objective function Dvq, which is defined as 

D VQ = /dxPr(x)^Pr(y|x)||x-x'(y)|| 2 (4) 
J y 

where J2 y P r (y) = 1 has been used to eliminate the J2 y P r (y) 1°S Q (y) term. 
This measures the average Euclidean distortion that occurs when the input x is 
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probabilistically encoded as y, and then subsequently reconstructed as x' (y). 
This is a soft version of the LBG vector quantiser objective function [2], in 
which y acts as a code index, Pr(y|x) acts a soft encoding prescription for 
probabilistically transforming x into y, and x' (y) acts as the corresponding 
code vector. The optimal Pr (y|x) that minimises Dvq is deterministic (i.e. 
each x is transformed to one, and only one, y), so Dvq actually leads to an 
LBG vector quantiser itself, rather than merely a probabilistic version thereof 
□ 

Under the same assumptions (see equation^ that yielded the expression for 
Dvq, the Helmholtz machine objective function ^2] would reduce to 

Dhm = d vq + f dx Pr (x) Pr (y l x ) lo e Pr (y l x ) ( 5 ) 

J y 

where the extra term is the so-called "bits-back" term, which is (minus) the 
entropy of the output y given that the input x is known, then averaged over 
all inputs. Thus Dhm does not directly penalise Pr (y|x) that have a large 
entropy, or, in other words, it allows the recognition model Pr (y|x) to be such 
that many output states y are permitted once the input state x is known. This 
means that the recognition models produced by a Helmholtz machine tend to 
be more stochastic than they would have been had the "bits-back" term been 
omitted from Dhm- Conversely, the objective function Dvq that is used in this 
paper directly penalises Pr (y|x) that have a large entropy, so the recognition 
models produced tend to be more deterministic than the stochastic ones that 
the Helmholtz machine would produce under equivalent circumstances. Thus 
using Dvq tends to lead to sparse codes in which few neurons can fire, whereas 
using Dhm tends to lead to distributed codes in which many neurons can fire. 

The chosen objective function has both an information theoretic interpre- 
tation (given by Dq in equation 0, in which it seeks to minimise the number 
of bits required to encode Pr (x,y), and also an encoder/decoder interpretation 
(given by Dvq in equation QJ, in which it seeks to minimise the Euclidean dis- 
tortion that arises when x is encoded as y and then subsequently reconstructed 
as x' (y). Also, using Dvq as the network objective function ensures backward 
compatibility with preexisting results (e.g. 00)- 

An upper bound on the network objective function is introduced in sec- 
tion [O] and the stationarity conditions which must be satisfied for an optimal 
network behaviour are derived in section l2~2l Joint encoding on a 2-torus is dis- 
cussed in section 12781 and factorial encoding on a 2-torus is discussed in section 

2.1 Objective Function 

In order to make progress it is necessary to make some assumptions about the 
network output state y. Thus the output layer will be assumed to consist 
of M neurons that fire discretely in response to the input activity pattern x. 
Furthermore, y will be assumed to be an n-dimensional vector, that consists 
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of the observations of the locations (yi,2/2> - •■ , Dn) of the first n firing events 
that occur in response to input x (this is described in detail in .1]). Note 
that the individual yi are scalars, but the generalisation to vector- valued yj is 
straightforward. 

For compatibility with results published earlier (e.g. jHl^), the objective 
function that will be used here is D — 2Dyq, which has an upper bound D1+D2 
given by (see appendix El for a detailed derivation and discussion) 

M 



Di=- /dxPr(x) jrPr(y|x)||x-x'(y)|r 
71 J y=i 



Do = 



_ 2(ra- 1) 



J dxPr (x) 



M 



x-^Pr(y|x)x' (y) 



(6) 



where Pr(y|x) is the probability that neuron y fires first in response to input 
x, and x' (y) is a reference vector that is used by neuron y in its attempt to 
approximately reconstruct the input. In the limit n = 1 only D\ contributes, 
and a standard LBG vector quantiser emerges when D\ is minimised. As n — ► 00 
only D 2 contributes, and a PCA encoder emerges when D 2 is minimised. 

This upper bound D\ + D 2 on the objective function D will be used to derive 
all of the results in this paper. Its functional form, in which Pr(y|x) appears 
only quadratically (unlike in equation for Dhm), allows analytic results to be 
readily derived. 



2.2 Stationarity Conditions 

The upper bound D\ + D 2 (see equationUJ on the objective function D = IDyq 
(see equation^ needs to be minimised with respect to two types of parameter: 
posterior probabilities Pr (y|x) and reference vectors x' (y). This could be done 
numerically for an arbitrary input PDF Pr (x) by using a gradient descent type 
of algorithm [Q, but here D\ + D 2 will be analytically minimised for some 
carefully chosen special cases of Pr (x). 

The stationarity condition 0^77^^ = gives (see appendix IB. ljl 

n J dxPr (x|y) x = x' (y) + (n - 1) J dxPr (x|y) ^ Pr (y'|x) x' (y') (7) 

v'=i 

where Pr (y) > has been assumed. The Q^Tfe^ = stationarity condition 
also has the solution Pr (y) = 0, but this solution may be discarded because 
Pr (y) > is always the case in practice. The right hand side of the stationarity 
condition in equation has two contributions: a Z?i-like contribution which 
is a single reference vector x' (y), plus a Z?2-like contribution which is n — 1 
times a sum of reference vectors Yl y '=i (I d-xPi (y'\x)PT (x\y)) x' (y 1 ), where 
the coefficient / dxPr (y'|x) Pr (x|y) accounts for the effect (at neuron y) of 
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observing all pairs of firing events (y, y') for y' = 1, 2, • • • , M. The sum of these 
two terms is n times the total reference vector that is effectively associated with 
neuron y, which is n times / dxPr(x|j/) x as given on the left hand side of 
equation 

The stationarity condition /^p^j^) = gives (see a,nnendix lB.2l) 

M / M \ 

(Pr Q/|x) - <W) x' (y'). -x' (y') - nx + („ - 1) £ Pr (y"|x) x' (y") = 
y'=i \ y"=i / 

(8) 

where the constraint P r (y'l x ) — 1 nas been imposed, and Pr (x) > and 

Pr(y|x) > have been assumed. The fiog^^/il.) = ^ stationarity condition 
also has two other solutions: either Pr (x) = 0, or Pr (x) > and Pr (j/|x) = 0. 
Using the normalisation constraint Y^Li P p (^l x ) = 1) the last of these solutions 
ensures that Pr (y'|x) < 1 for y' ^ y, and when all values of y are considered the 
net effect is to constrain Pr (y|x) to the interval < Pr (y|x) < 1, as expected. 

The solutions of the stationarity condition for Pr (y|x) in equation |H1 are 
piecewise linear functions of x. This piecewise linear property of Pr (y\x) (as 
discussed in annendix lB,2|l is an enormous simplification, because it means that 
rather than searching the infinite dimensional space of functions Pr(y|x) for 
the optimal ones that minimise D\ + D2, one needs only search a finite dimen- 
sional space of piecewise linear functions Pr (y|x) (subject to the constraints 

< Pr (y|x) < 1 and E^i Pr (tf|x) = 1). 

2.3 Joint Encoding 

Joint encoding, as shown in figureEta), is characterised by a Pr(y|x) in which 
the neurons labelled by y form a discretised version of the manifold that x lives 
on. For instance, when x lives on a 2-torus, so that x = (xi,X2) where xi = 
(cos#i, sin#i) and x 2 = (cos 82, sin 82), where < 8\ < 2tt and < 82 < 2ir, 
the Pr(y|x) typically behave as shown in figure EJa), where the 2-torus is tiled 
with encoding cells. When n > 1 neighbouring encoding cells overlap, so figure 
Eta) does not then give an accurate representation of the encoding cells. 

For joint encoding of a 2-torus, y must be replaced by the pair (2/1,1/2), where 
the yi index labels one direction around the toroidal lattice, and j/2 labels the 
other direction (this notation must not be confused with the (2/i,J/2> ■ " ,?/«) 
notation that was used in section l2!Tll . Thus Pr(y|x) — > Pr (yi, J/2IX1, X2) with 

1 < 3/1 < \/M and 1 < 2/2 < VM. For simplicity, assume Pr(xi,X2) = 
Pr (xi) Pr (X2), where Pr(xi) and Pr(x 2 ) each define a uniform PDF on the 
input manifold. The following results for D\ and D 2 may then be derived (see 
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appendix IC.ljl 



4 r ^ 
Di = - rfx 1 Pr(x 1 ) ^Pr^ilxiJHxi-xifvi) 

2/1 = 1 



£>2 = 



4(n- 1) 



/" dx! Pr ( Xl ) 



xi - 51 Pr (yil x i) x i (yi) 

2/1=1 



(9) 



These results for D x and D 2 show that, under the simplifying assumptions 
made above, the problem of optimising a joint encoder is equivalent to the 
problem of optimising an encoder for Xi alone (with the replacement M — ► 
VM) , and then multiplying the value of D\ + Z? 2 by a factor 2 to account for x 2 
as well. This illustration of the behaviour of joint encoder posterior probabilities 
in the case of Pr (yi, j/ 2 |xi, x 2 ) may readily be generalised to higher dimensions. 



2.4 Factorial Encoding 

Factorial encoding, as shown in figure Gib), is characterised by a Pr(y|x) in 
which the neurons labelled by y are partitioned into a number of subsets, each 
of which forms a discretised version of a subspace of the manifold that x lives 
on. For instance, when x lives on a 2-torus, and the neurons are partitioned 
into two equal-sized subsets, the Pr(j/|x) typically behave as shown in figure 
Etb), where each of the two circular subspaces within the 2-torus is tiled with 
encoding cells, which overlap when n > 1. 

For factorial encoding of a 2-torus Pr (y|x) = Pr (y|xi,X2) = |Pr(y|xi) + 

|Pr(y|x 2 ), where £^=1 Pr (y|xi) = 1, E*1m +1 Pr (y|x 2 ) = 1, Pr(y|xi) = 

for 4p + 1 < y < M, and Pr (y|x 2 ) = for 1 < y < 4f . For simplicity, assume 
Pr(xi,x 2 ) = Pr (xi) Pr (x 2 ), where Pr(xi) and Pr(x 2 ) each define a uniform 
PDF on the input manifold. The following results for D\ and D 2 may then be 
derived (see appendix IC.2J) 




These results for D\ and D 2 show that, under the simplifying assumptions 
made above, the problem of optimising a factorial encoder is closely related to 
the problem of optimising two 1-dimensional encoders. This illustration of the 
behaviour of factorial encoder posterior probabilities in the case of Pr (y|xi, x 2 ) 
may readily be generalised to higher dimensions. 
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3 Circular Manifold 



The analysis of how to encode data that lives on a curved manifold begins with 
the case of data that lives on a circle. In particular, assume that the input 
vector x is uniformly distributed on the unit circle centred on the origin, so 
that x can be parameterised by a single angular variable 9, thus 

x = (cos 9, sin 9) 
|dxPr(x)(...) = ^ 27r ^(...) (11) 

The posterior probability Pr(y|x) may thus be replaced by Pr(y|0), and for 
purely conventional reasons, the range of y is now chosen to be y = 0, 1, • • ■ , M— 
1 rather than y — 1,2, ■• • , M. The set of M posterior probabilities for y — 
0, 1, • • • , M — 1 can be parameterised as 

Pr(y\e)=p(e-^ (12) 

where p (9) is the ^-dependence of the posterior probability associated with the 
y = neuron. The 0-dependence of p (9) must be piecewise sinusoidal (i.e. made 
out of pieces that each have the functional form a + b cos 9 + c sin 9) in order to 
ensure that Pr (y|x) is piecewise linear, as is required of solutions to equation 
l55l Similarly, the M corresponding reference vectors can be parameterised as 

(1 3) 

which all have length r, and thus form a regular M-sided polygon. 

It turns out that, for input vectors that live on a circular manifold, optimal 
joint encoding never causes more than 3 different neurons to fire in response to a 
given input (i.e. no more than 3 posterior probabilities overlap in input space). 
This severely limits the number of different piecewise functions that have to be 
manipulated when solving the D\ + D 2 minimisation problem for input vectors 
that live on a circle. An analogous simplification also holds for joint and factorial 
encoding of a 2-torus. The case of 2 overlapping posterior probabilities can be 
optimised without too much difficulty, but the case of 3 overlapping posterior 
probabilities involves a prohibitively large amount of algebra, for which it is 
convenient to use an algebraic manipulator ^2l- The calculations turn out to 
be highly structured, so the use of an algebraic manipulator could in principle 
be used to solve even more complicated analytic problems. 

All of the results for encoding input data that lives on a circular manifold may 
be derived from the expression for Di+D 2 in equation|£](and the corresponding 
stationarity conditions), with the replacement given in equation El to ensure 
that the input manifold corresponds to a uniform distribution of data around a 
unit circle, and the functional forms given in equation El an d equation 1131 

The corresponding results for joint encoding of data that lives on a 2-torus 
can be obtained directly from these results (see section l2~3l) . The expression for 
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the minimum value of D\ + D2 for joint encoding a 2-torus using \[M x \[M 
neurons is obtained by making the replacement M — ■> VM in the expression for 
the minimum value of D\ +D2 for encoding a circle using M neurons, and then 
multiplying this result by 2 in order to account for both the circles that form 
the 2-torus (see equation EJ. 



3.1 Two Overlapping Posterior Probabilities 

A detailed derivation of the results reported in this section is given in appendix 
ID. II Because the neurons have an angular separation of (see the form of the 
posterior probability given in equation El, the functional form of p(9) may be 
defined as 

f 1 ^<\e\<fj-s 

P{0) = { fV) ff-s<\6\<% + s (14) 
{ \0\>fj + S 

where the s parameter is half the angular width of the overlap between the 
posterior probabilities of adjacent neurons on the unit circle, in which case 
< s < jj ensures that no more than two neurons can respond to a given 
input. Anticipating the optimum solution, a typical example of this type of 
posterior probability is shown in figure 

In order to guarantee that Pr (y|x) has a piecewise linear dependence on x, as 
is required of solutions of equation|Hl / (9) must have the sinusoidal dependence 
f (9) = a + bcos9 + csin \9\, where the use of \9\ arises because p{9) = p (—6). 
Note that the Pr (x) = solution to the stationarity condition on Pr(y|x) (see 
equation |Hl implies that Pr(y|x) is undefined for any x that does not lie on 
the unit circle. However, for those x that do lie on the unit circle, the a, b 
and c parameters can be determined by demanding continuity of p (9) at the 
ends of its piecewise intervals (i.e. at 9 = jg — s and 9 = jg + s), and by 
demanding that the total probability of any neuron firing first is unity (i.e. the 
total posterior probability is normalised such that / (6) + f f 4y — 6) = 1 in the 
interval -fj — s < 9 < -fj + s), to obtain 

1 1 sin (*-& - 6) 

This corresponds to a piecewise linear contribution to Pr(y|x) whose gradient 
points in the (— sin (p-) , cos \ jgj) direction. A typical example of this type of 
posterior probability is shown in figure 

Without loss of generality (because the solution is symmetric under rotations 
of 9 which are multiples of |j) set y = in equation to obtain in the interval 

=r 



esc 2 s sin ( — ) sin ( 9 ) (sin s — sin ( 9 , 

\mJ \m A Km J 

< (n sin s — (n — 1) r sin ) (16) 
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Figure 5: Plot of the optimal neural posterior probability p (9) for M — 8 and 
n = 2. The neighbouring posterior probabilities p{9 ± ^) are also plotted. The 
optimal value of s is s sa 0.49-p-. The departure of p{9) from linearity in the 
interval -jj — s < 9 < j-[ + s is too small to be easily seen. 



which may be solved for the optimum length r of the reference vectors, to yield 



sin s 



' n-lsin(^) (17) 

Set y = in equation to obtain a transcendental equation that must be 
satisfied by the optimum s 

sin s n—lM 



> M , 



M f 7T \ . . 

— sm — (cos s + s sin s) = 
7T V M / 



(18) 



The symmetry of the solution may be used to make the replacement 
J 2lr d9 (•■•)—> M J o af d# (• • • ) in the expressions for £>i and Z?2, which may 
then be evaluated and simplified to yield the minimum D% + D2 as 



D 2 = 2- 



M 



1 2tt 



(2s + sin (2s)) 



(19) 



The value of s which should be used in this expression for Di + D 2 is the solution 
of equation El for the chosen values of M and n. 

Note that the expression for r in equationEland the expression for D\ +D 2 
in equation^|both have a finite limits asrn 1, because the limiting behaviour 
of the solution s of equation is s — ■> (n — 1) — sin 2 (-p^) (see the asymptotic 
results in section U2J, which contains a factor n — 1 to cancel the 
appears in both equation El and equation El 



1 j factor that 
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3.2 Three Overlapping Posterior Probabilities 

A detailed derivation of the results reported in this section is given in appendix 
ID. 21 Because the neurons have an angular separation of |j , the functional form 
of p (9) may be defined as 

0<\0\<-fj + s 

v (e)-( J ^"> -Ji + S <\ e \<^- S (20) 

\°\>fl+s 

where the s parameter is half the angular width of the overlap between the 
posterior probabilities of adjacent neurons on the unit circle, in which case 
■fi < s < jj ensures that no more than 3 neurons can respond to a given input. 
Anticipating the optimum solution, a typical example of this type of posterior 
probability is shown in figure H3 

In order to guarantee that Pr (y|x) has a piecewise linear dependence on x, 
the fi (9) must have the sinusoidal dependence fi (9) = at + bi cos 8 + Ci sin |0| for 
i = 1,2,3. For those x that lie on the unit circle, the <Zj, bi and Ci parameters 
can be determined by imposing continuity of p (8) at 9 = — + s, 8 = 2j — s 
and 9 = jt + s, and normalisation of the total posterior probability such that 
A (0) + fa (jf + 0) + h - 6 ) = 1 in the interval < 9 < -ft + s, and 
h (8) + h (ff - °) = 1 in the interval - jj + s<9<^-s. Also, to satisfy 
the stationarity conditions, set y = in equation and also set y = in 
equation in each of the intervals < 9 < — jf + s, —jj + s<9<jj- — s and 
jj — s < 9 < jj + s. These conditions are sufficient to solve for the optimum 
The fi (9) for i = 1, 2, 3, the optimum r, and the optimum s. 

The optimum fi (9) are 

fi (9) = — | cos ( s | + coss — 2cos ( — ^ cos 6* | esc 2 ( — ^ sec ( — — s | 

»-i"'(i)(-g-')-(s-)- 1 ) < 21 » 

which correspond to different piecewise linear contributions to Pr(y|x). The 
/i (9) piece has a gradient that points in the (1,0) direction, the fi (9) piece 
has a gradient that points in the (— sin (4jJ , cos (jt)) direction, and the /a (9) 
piece has a gradient that points in the (— sin ,cos(||)) direction. The 
optimum r is 

n cos(^-s) 



'"n-1 co S (^) (22) 
and the transcendental equation that must be satisfied by the optimum s (for 
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M = 4 this reduces to equation 11811 is 

lCOs(|y-s) n-lM / 7T \ / /2?r 

n cos(^) n tt \MJ \ \M 
and the minimum D\ + D2 may be obtained as 



2tt 



2tt 
M 



s 
(23) 



D l + D 2 = 



n ((n- 1) (22 



)) 



2(n- 1) 

((n-l)(2-f .s)+sec 2 (t)) (to 



2(n- l) z 



(24) 



As in section 13.11 the limit n — > 1 is well behaved because the limiting 
behaviour of the solution s of equation contains a factor n — 1 (see the 
asymptotic results in section to cancel the factor that appears in both 
equation |22] and equation 1241 

P(9) 




M M 



Figure 6: P/oi 0/ £/ie optimal neural posterior probability p (9) for M = 8 and 
n = 100. The neighbouring posterior probabilities p (6 ± are a/so plotted. 
The optimal value of s is s « 1.39-^j. 



The results for the optimum value of s (i.e. equation El and equation l23l 
may be combined to yield the results shown in figured 

Asymptotically, as M — > 00 and n — > 00, the contour s = (the dashed line 
in figured, which is the boundary between the regions where 2 and 3 posterior 
probabilities overlap, is given by n ss 3-^- (see the asymptotic results in section 
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Figure 7: Contour plot of the optimum value of s versus (n, M) for encoding 
of a circular manifold. The solid contours are for the interval < s < ■?? , the 
dotted contours are for ■?? < s < jj, and the dashed contour is for s = jt 



(this behaves asymptotically as n 
intervals of jkm- 



3—2- ). The contours are all separated by 



The corresponding results for joint encoding of input vectors that live on a 
2-torus are shown in figure |HI 

4 Toroidal Manifold: Factorial Encoding 

All of the results for factorial encoding of input data that lives on a toroidal 
manifold may be derived from the expression for D\ + D2 in equation Ei| (and 
the corresponding stationarity conditions), with the appropriate replacements 
for equations El El and 

The posterior probability p (9) then has the same functional form as for a 
circular manifold, except that M is replaced by because each of the two 
dimensions uses exactly half of the total of M neurons, so these results are not 
quoted explicitly here. The steps in the derivation of the optimum values of r 
and s and the minimum value of Di + D 2 are analogous to the steps that appear 
in the derivation for a circular input manifold, and the results are sufficiently 
different from the ones that were obtained from a circular manifold that they 
are quoted explicitly here. 
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Figure 8: Contour plot of the optimum value of s versus (n, M) for joint encod- 
ing of a toroidal manifold. The solid contours are for the interval < s < ^= , 

the dotted contours are for — < s < -7=, and the dashed contour is for 

3 VM — VM J 

s = ^= (this behaves asymptotically as n m 3^). The contours are all sepa- 
rated by intervals of 
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4.1 Two Overlapping Posterior Probabilities 

A detailed derivation of the results reported in this section is given in appendix 
ID. 31 The stationarity conditions yield the optimum r as 

2n sins 

' ' — (25) 



lsin(f) 



The transcendental equation that must be satisfied by the optimum s is 

sin s n — 1 M ( 2ir \ 

sin I — I (coss + ssmsj = (26) 



sin(ff) n+127r \M , 
The expression for the minimum D\ + D2 is 



£>i + L>2 = 4 r L-—(2s + sin(2s)) (27) 

n — 1 Ztt 
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4.2 Three Overlapping Posterior Probabilities 

A detailed derivation of the results reported in this section is given in appendix 
ID. 41 The stationarity conditions yield the optimum r as 

2n cos(4f-s) , . 

(28) 



n - 1 cos (|J 



The transcendental equation that must be satisfied by the optimum s is 

lcos(fe-s) n-lM /2tt\ / /4vr \ /4tt \ /4tt 

cos — sin — — s — — — s \ cos — — s 



n cos(ff) 2n 2vr \M J \ \M J \M J \M 

(29) 

The expression for the minimum D\ + D% is 



M 

/ ,' II// — I ) I _1 - - 



„((„-!) (2==* -|L s )-2sec^)) 



(n-l) 2 

cos (30) 



(ra-1) 2 V M 

The results for the optimum value of s (i.e. equation [23 and equation |2 
may be combined to yield the results shown in figure 03 



5 Joint Versus Factorial Encoding 

The results in section and section 21 may be used to deduce when a factorial 
encoder is favoured with respect to a joint encoder (for input data that lives on 
a 2-torus). Firstly, equation El (with the replacement M — > \[M, and setting 
s = may be used to deduce the region of the (n, M) plane where joint 

encoding of a 2-torus involves no more that 2 overlapping posterior probabilities, 
and equation [2£| (with s = jj) may be used to deduce the corresponding result 
for factorial encoding of a 2-torus. Once these regions have been established, it 
is then possible to decide which of equation El or equation [21 (with M — > \[M 
and then multiplied overall by 2) to use to calculate D\ +D% in the case of joint 
encoding a 2-torus, and which of equation [53 or equation 023 to use to calculate 
D\ + Di in the case of factorial encoding a 2-torus. These results are gathered 
together in figure ITf)l 

The need to derive results where up to 3 posterior probabilities overlap 
(which involves a large amount of algebra) is clear from the results shown in 
figure El where it may be seen that most of the region where the factorial 
encoder is favoured with respect to the joint encoder has up to 3 overlapping 
posterior probabilities. The degree to which a factorial encoder is favoured with 
respect to a joint encoder may be seen in figure ITT1 

If the number of neurons M is restricted (i.e. M < 12), then the joint 
encoding scheme in which the 2-torus is encoded using small encoding cells as 
shown in figure E^a), is usually not as good as the factorial encoding scheme in 
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Figure 9: Contour plot of the optimum value of s versus (n,M) for factorial 
encoding of a toroidal manifold. The solid contours are for the interval < 
s < jj, the dotted contours are for jj < s < fe, and the dashed contour is 

for s — (this behaves asymptotically as n w f-^-J. The contours are all 
separated by intervals of -rjj. 



which the 2-torus is encoded using the intersection of pairs of elongated encoding 
cells as shown in figure Gib). This does require that the number of firing events 
n is sufficiently large that both subsets of 4p neurons in the factorial encoder 
are virtually guaranteed to each receive at least 1 firing event, so that they can 
indeed approximate the input vector by the intersection of a pair of response 
regions. 

If the number of neurons M is too large (i.e. M > 12), then the joint en- 
coding scheme is always favoured with respect to the factorial encoding scheme, 
because there are sufficient neurons to encode the 2-torus well using small re- 
sponse regions, as shown in figure|2Ia). This includes the limiting case M — > oo, 
where the curvature of the input manifold is not visible to each neuron sepa- 
rately, because each neuron then responds to an infinitesimally small angular 
interval of the input manifold. This result implies that joint encoding is always 
favoured when the input manifold is planar, as was discussed in figure and 
figure 

Although not presented here, these results generalise readily to higher dimen- 
sional toruses, where factorial encoding is even more favoured, because (roughly 
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Figure 10: The diagram shows various results pertaining to joint and factorial 
encoding of a 2-torus. The solid line is the boundary between the regions of the 
(n, M) plane where joint or factorial encoding are favoured, and the horizontal 
dashed line is the asymptotic limit M w 12 of this boundary as n — ► oo. The left 
hand dashed line is the boundary between the regions where 2 or 3 overlapping 
posterior probabilites occur in joint encoding, and the right hand dashed line is 
the corresponding boundary for factorial encoding. 



speaking) the number of neurons required to do joint encoding with a given res- 
olution increases exponentially with the dimensionality of the input, whereas 
the number of neurons required to do factorial encoding with a given resolution 
increases linearly with the dimensionality of the input (provided that enough 
firing events are observed). 

6 Asymptotic Results 

Referring to figure El the asymptotic behaviour as M — > oo lies in the region 
where two posterior probabilities overlap, and the asymptotic behaviour asn-» 
oo lies in the region where three posterior probabilities overlap, so care must 
be taken to use the appropriate results when deriving the various asymptotic 
approximations below. The boundary between the regions where two or three 
posterior probabilities overlap can be obtained for a circular input manifold by 
putting s = j£ in equationEI(or s = jfr in equation |2H] in the case of a toroidal 
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Figure 11: Plots for M = 6, 7, 8, 9, 10, 11 of (D 1 + D 2 ) factorial - (D 1 + D 2 ) Jomt 
in units in which (D\ + D2) f actoria i — 1- This makes it clear that the degree 
to which a factorial encoder is favoured with respect to a joint encoder is quite 
significant for large n. 



input manifold), and as M — > 00 this is given by 
I 34^- circular manifold 



3 Ar 

2 ~w r 



toroidal manifold (factorial encoding) 



(31) 




As M — > 00 the asymptotic behaviour of D1+D2 for a circular input manifold 
may be obtained by asymptotically expanding the s dependence of equation 1181 
(or equation [2J3 in the case of a toroidal input manifold) in inverse powers of 
M, to yield 

< '"~ 1 ^3k3" 4 " +2 ^ ~M* circular manifold 

^ n ~3(n+i)^ n+1 ^ (ivf") toroidal manifold (factorial encoding) 

(32) 

and substituting this solution into the appropriate expression for r to obtain 
1 + 1,271 Qni + ^ jjz circular manifold 

n+T 8 "3(K+t"^ 1 ' ) TP toroidal manifold (factorial encoding) (33) 



and substituting this solution into the appropriate expression for D\ + D2 to 
obtain 

f 2(2w-l) 7r 2 



1 —q — 2 T7I CUJ-UUUeU. IlldllllUlU , . 

D 1 +D 2 ^-> 4 3 ", Wn 2 ^ , . A , , 34 

I ^-j-j- + 3 („ +1 )3 ]vf7 toroidal manifold (factorial encoding) 



The asymptotic result for a circular manifold may be used to determine the 
corresponding result for a linear manifold. Thus, if lengths are scaled so that the 
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separation of the neurons (as measured around the circular manifold) becomes 
unity, which requires that all lengths are divided by then asymptotically 
as M — ► oo the circular manifold solution becomes identical to the solution for 
a linear manifold with neurons separated by unit distance. Thus the optimum 
solution for a linear manifold with neurons separated by unit distance is s = 
and D± + D2 = 2 g~ 2 1 (note that D\ + D2 has the dimensions of (length) 2 ). 

As n — > 1 (i.e. the LBG vector quantiser limit) the asymptotic behaviour 
of D\ + D2 for a circular input manifold may be obtained by expanding the 
s dependence of equation ^] about the point s = (or equation [2HI about the 
point s — for a toroidal input manifold), to yield 



(n — 1) f sin 2 {jj) circular manifold 

^^^-^-sin 2 (^j) toroidal manifold (factorial encoding) 



which gives s = when n — 1, so there is no overlap between the posterior prob- 
abilities for different neurons, as would be expected in a vector quantiser where 
only one neuron is allowed to fire. Substitute this solution into the appropriate 
expression for r to obtain at n = 1 

f-sin(-p) circular manifold , . 

^rsin(|j) toroidal manifold (factorial encoding) 

which is the distance of the centroid of an arc of the unit circle (with angular 
length |j for a circular manifold, or angular length 4? for a toroidal manifold) 
from the origin, as expected for a network in which only one neuron can fire. 
So the best reconstruction is the centroid of the inputs that could have caused 
the single firing event. These results may be substituted into the appropriate 
expression for D\ + D2 to obtain at n = 1 



D 1 +D 2 




circular manifold 

toroidal manifold (factorial encoding) 

(37) 

These results for D\ + D2 have a simple geometrical interpretation. For a 
circular manifold D\ + D2 is (twice) the average squared distance from an arc 
with angular length |j to its associated reference vector, which is exactly what 
would be expected. For a toroidal manifold D\ + D2 is the same result with 
M —* t^, plus an extra contribution of 2, because a factorial encoder with only 
1 firing event acts as a conventional encoder using 4f neurons for the circular 
dimension that is fortunate enough to be associated with the firing event (hence 
the first contribution to D\ + D 2 ), and acts as no encoder at all for the other 
circular dimension which is associated with no firing events (hence the extra 
contribution of 2 to D\ + D2). 

As n — > 00 the asymptotic behaviour of D1 + D2 for a circular input manifold 
may be obtained by expanding the s dependence of equation[2BI about the point 
s = (or equation [221 about the point s = 4j- for a toroidal input manifold) , 
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to yield 



27T _ / 3tt 
, M [ M n cos 2 ( f. , . 

^ Uf// i (38) 



4-rr _ 

M I Af„cos 2 (^) 



circular manifold 
toroidal manifold (factorial encoding) 



where the limiting values of s as n — > oo (i.e. s — > ^ for a circular manifold, 
and s — > |j for a toroidal manifold) stops just short of allowing four or more 
posterior probabilities to overlap. In this limit D\ — 0, so for a circular manifold 
the network acts as a PC A encoder (see the discussion after equation E3l whose 
expansion coefficients sum to unity. In order to encode vectors on a unit circle 
without error three basis vectors are required; the expansion coefficients are 
probabilities which must sum to unity, so three basis vectors are required in order 
that there are two independent expansion coefficients. This is the reason why it 
is sufficient to consider no more than three overlapping posterior probabilities for 
encoding data that lives in a 2-dimensional manifold (this argument generalises 
straightforwardly to higher dimensions) . The same argument applies to the case 
of factorial encoding of a toroidal manifold. Substitute this solution into the 
appropriate expression for r to obtain 



±sec(£)(2-(- 



\ 2/3 \ 

3 \ , = ; ] I circular manifold 



sec(^j) ( 2 — ( 12 2(hL\ ) ) toroidal manifold (factorial encoding) 



Mncos 2 (j) 

, 2 /3 

■Tt) l Z _ ^Mncos 2 (f ) 

(39) 

and substitute these results into the appropriate expression for D\ + D2 to 
obtain 

D D ~< ™~^ an2 (^f} circular manifold 

1 2 ~ I 1 (2 sec 2 (j{f) — l) toroidal manifold (factorial encoding) 

(40) 

Thus as n — > 00 it is possible to derive a value of M for which the asymptotic 
D\ + D2 is the same for joint and factorial encoding of a toroidal manifold. 
This value of M must satisfy \ tan 2 \ -f=) = ~ (2 sec 2 ( 2 J f) - l) , which yields 
M fa 11.74. 



7 Approximate the Posterior Probability 

A posterior probability may always be written in the form 

22y>=o Q ( x b ) 

where Q (x|y) > (with Q (x|zy) > for at least one value of y for each x). If 
the neurons behaved in such a way that they produced independent Poissonian 
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firing events in response to a given input, then Q (x|y) would be the firing rate 
(or activation function) of neuron y in response to input x. 

The optimum solution p (9) (as given in equation and equation E| may 
be approximated on the unit circle (i.e. x = (cos 6*, sin 6>)) by defining Q (x|y) as 



Q(x|y) = 



w • x > a 
w • x < a 



{2ny\ . (2vy 

/ 7T \ . / 7T 



— sin — sm s 

M / 



(42) 



where a is a threshold parameter, and w is a unit weight vector. This is the 
form of the neural activation function that is used in ^j] . This leads to a good 
approximation to the optimum solution p (9) because 







9 < 



Q(x\y=Q) 



p(«) = 



Q(x| a =0)+Q(x|y=M-l 
1 

Q(x\y=Q) 

Q(x|y=0)+Q(x| 





M 



M b — ° - M 



M 



s<9 < 



A I 



fl-s<0<fl 



(43) 

This approximation works well because curved input manifolds can be optimally 
encoded by using appropriate hyperplanes (as defined in equation I42J1 to slice 
off pieces of the manifold. 

This approximation breaks down as M — > oo, as can be seen by inspecting 
the series expansion of p (9) near = -jjj. 



p{9) 



1 _ 1 1 

2 2 sin s 

+o 



Mi 



Ml 



\ 1_ 

12 sin s 



exact 



1 i 

2 sin s 



Mi 



J_ 

12 



1 

sin .s 



tan -o- sin^ 
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Mj 



which differ in the O 



-it term 



width parameter s behaves like M 

M (9 — -ff) 3 in the exact case, and M 3 
cause of the contribution from the r— — 



so the O 



In the limit M 

3 



(('-£) 



approximate 
(44) 

— > oo the half- 
term behaves like 



-p) in the approximate case be- 
■ term. As M — > oo each neuron 



responds to a progressively smaller angular range of inputs on the unit circle, 
so from the point of view of each neuron the curvature of the input manifold 
becomes negligible (i.e. the input manifold appears to more and more closely 
approximate a straight line), which ultimately makes it impossible to use hy- 
perplanes to slice off pieces of the manifold. In the M — ► oo limit, a better 
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approximation to the posterior probability would be to use ball-shaped regions 
(e.g. a radial basis function network) to cut up the input manifold into pieces. 

8 Conclusions 

The results in this paper demonstrate that, for input data that lies on a curved 
manifold (specifically, a 2-torus), and for an objective function that measures the 
average reconstruction error (in the Euclidean sense) of a 2-layer neural network 
encoder, the type of encoder that is optimal depends on the total number of 
neurons and on the total number of observed firing events in the network output 
layer. There are two basic types of encoder: a joint encoder in which the network 
acts as a vector quantiser for the whole input space, and a factorial encoder in 
which the network breaks into a number of subnetworks, each of which acts as 
a vector quantiser for a subspace of the input space. 

The particular conditions under which factorial encoding is favoured with 
respect to joint encoding arise when the input data is derived from a curved input 
manifold, provided that the number of neurons is not too large, and provided 
that the number of observed neural firing events is large enough. Factorial 
encoding does not emerge when the input manifold is insufficiently curved, or 
equivalently when there are too many neurons, because then each neuron does 
not have a sufficiently large encoding cell to be aware of the manifold's curvature. 

Factorial encoding allows the input data to be encoded using a much smaller 
number of neurons than would be the case if joint encoding were used. Because 
only a small number of neurons is used, a factorial encoding scheme must be 
succinct, so it has to abstract the underlying degrees of freedom in the input 
manifold; this is a very useful side-effect of factorial encoding. This effect be- 
comes stronger as the dimensionality of the curved input manifold is increased. 

The main simplification that makes these calculations possible is that, in 
an optimal neural network, the form for the posterior probability is a piecewise 
linear function of the input vector. This leads to an enormous simplification in 
the mathematics, because only the space of piecewise linear functions needs to 
be searched for the optimal solution, rather than the whole space of functions 
(subject to normalisation and non-negativity constraints). 

A convenient approximation to this type of factorial encoder is the parti- 
tioned mixture distribution (PMD) network m which the individual sub- 
networks in the factorial encoder network are constrained to share parameters, 
which thus leads to an upper bound on the minimum value of the objective 
function that would have ideally been obtained with the unconstrained factorial 
encoder network. 
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A Objective Function 



The objective function D = 2Dvq is given by 

D = 2 f dxPr(x)^Pr(y|x)||x-x'(y)|| 2 (45) 
J y 

If the observed state of the output layer is the locations of n firing events on M 
neurons, then this expression for D can be manipulated into the following form 

m 

M M M 

D = 2 dxPr(x) ^2 "' X! Pr (yi'2/2,--- , 2/«|x) ||x - x' (2/1,2/2, ■ • • ,2/n)|| 2 

S/i=l 2/2=1 j/„=l 

(46) 

where Pr (y|x) has now been replaced by the more explicit notation 
Pr (?/i,y2, • • • , 2/n|x), and x' (y 1} y 2 , •••,?/„) is a vector given by 

x' (2/1,2/2, •• • ,y n ) = J dxPr(x\y 1 ,y 2 ,--- >Vn) x (47) 

where Pr (x|yi, y 2 , ■ ■ ■ ,y n ) may be expressed in terms of Pr (x) and 
Pr (2/1 1 2/2 , ■ • ■ ,2/n| x ) by using Bayes' theorem in equation The goal now is 
to minimise the expression for D in equation 021 with respect to the function 
Pr (2/1, 2/2, • • • , 2/n|x). The correct value for x' (2/1, 2/2, • • • > ma Y be determined 
by treating it as an unknown parameter that has to be adjusted to minimise D. 

Pr (2/1, 2/2, • • • , 2Ml x ) mav be interpreted as a recognition model which trans- 
forms the state of the input layer into (a probabilistic description of) the state 
of the output layer, and x' (2/1, 2/2, • • • , 2/n) may be regarded as the correspond- 
ing generative model that transforms the state of the output layer into (an 
approximate reconstruction of) the state of the input layer. 

There is so much flexibility in the choice of Pr (2/1, 2/2, • • • ,2Ml x ) ( an d the 
corresponding x' (2/1,2/2, • ■ ■ , 2/n)) that even if D is minimised, it does not nec- 
essarily yield an encoded version of the input that is easily interpretable. One 
way in which a code can be encouraged to have a simple interpretation is to 
force x' (2/1, 2/2, •• • , 2M) (i-e. the generative model) to be parameterised thus [1J 

x' (2/1,2/2,- •• ,2/«) =x'(2/i)+x'(2/ 2 ) + --- + x'(2/„) (48) 

which is a (symmetric) superposition of reference vectors x' (y) from each neuron 
y that has been observed to fire. In this case each neuron has a clearly iden- 
tifiable contribution to the reconstruction of the input, which makes it much 
easier to interpret what each neuron is doing. In this case the ||- • • || 2 term 
in D is symmetric under interchange of the (2/1,2/2, ■ • • ,2/n), so only the sym- 
metric part 5 [Pr (2/1 , 2/2 , • • • , 2/n|x)] of Pr (2/1,2/2, ■ • • , 2/«|x) under interchange 
of the (2/1,2/2,- •• ,2/n) contributes to D, because the symmetric summation 
Sm=i ^2yl=i ' ' ' ^2yt=i (' ' ' ) then removes all non-symmetric contributions. 
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Define the marginal probabilities Pr (yi|x) and Pr (j/i, 1/2 |x) of the symmetric 
part S [Pr (y x , y 2 , ■ ■ ■ , y n |x)] of Pr (yi, y 2 , ■ ■ ■ ,y n \x) under interchange of the 
{yi>V2, ■ ■■ ,Vn) as 



,1/ 



Pr( Vl |x)= S[Pr(y u y 2 ,-.. ,y„|x)] 

V2,V3,V4,-" ,Bn=l 
M 

Pr( yi ,y 2 |x)= £ 5[Pr( 2/1 ,y 2 ,--. ,y„|x)] (49) 

2/3,2/4, ,3/n=l 

These marginal probabilities are for the case where n firing events have poten- 
tially been observed, but only the locations of 1 (or 2) firing event(s) chosen 
randomly from the total number n have actually been observed, with the loca- 
tions of the other n — 1 (or n — 2) firing events having been averaged over. 
If it is assumed that Pr (yi|x) and Pr y 2 |x) are related by 

Pr( W ,|tt|x) =Pr(yi|x)Pr(«2|x) (50) 

then the objective function D has an upper bound D\ + D 2 given by 

D < D x + D 2 

. M 

Dx=-J dxPr(x)^Pr(y|x)||x-x' {y)\\ 2 



y=i 



J 2 = 2(n l) I dxPr(x) 



M 

X ~ ( y ' X ) X ' ^) 

J/=l 



(51) 



Each of the two marginal probabilities in equation 0^1 contributes to a different 
term in D\ + D 2 ; Pr(j/i|x) contributes to D±, whereas Pr (yi,y 2 \x.) contributes 
to D 2 . Informally speaking, D\ measures the information that a single firing 
event (out of n such events) contributes to the reconstruction of the input, 
whereas D 2 measures the information that pairs of firing events (out of n such 
events) contribute to the reconstruction of the input. D\ is weighted by a factor 
— which suppresses the single firing event contribution as n — > 00, whereas D 2 is 
weighted by a factor s — - which suppresses the double firing event contribution 
as n — * 1, as expected. If only the D\ part of the objective function is used (i.e. 
n = 1), then a standard LBG vector quantiser emerges which approximates 
the input by a single reference vector x' (y), whereas if only the D 2 part of the 
objective function is used (i.e. n — > 00), then the network behaves essentially 
as a principal component analyser (PCA) which approximates the input by a 
sum of reference vectors ^2yLi P p (y| x ) x ' (v); where the Pr (y\x) are expansion 
coefficients which sum to unity, and the x' (y) are basis vectors. 

The upper bound D\ + D 2 on D contains LBG encoding and PCA encoding 
as two limiting cases, and gives a principled way of interpolating between these 
extremes. This useful property has been bought at the cost of replacing D by 
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an upper bound bound D\ + D2, which will yield only a suboptimal (from the 
point of view of D) encoder. However, this upper bound can be expected to 
be tight in cases where the input manifold can be modelled accurately using 
the parameteric form x' (yi) + x' (y 2 ) + • ■ ■ + x' (y n )- These conditions are well 
approximated in images which consist of a discrete number of constituents, each 
of which may be represented by an x' (y) for some choice of y. This model fails 
in situations where two or more constituents are placed so that they overlap, in 
which case the image will typically contain occluded objects, whereas the model 
assumes that the objects linearly superpose. Occlusion is not an easy situation 
to model, so it will be assumed that the image constituents are sufficiently sparse 
that they rarely occude each other. 



B Stationarity Conditions 

The expression for D\ + D2 (see equation U2J has two types of parameters that 
need to be optimised: the reference vectors x' (y) and the posterior probabilities 
Pr(y|x). In annendix IB. II the stationarity condition for x' (y) is derived, and 
in annendix IB. 21 the stationarity condition for Pr(y|x) is derived, taking into 
account the constraints < Pr (y|x) < 1 and Y^jjLi ^ >T (?/l x ) = 1 which must be 
satisfied by probabilities. 



B.l Stationary x' (y) 

9* ' ( •! I 



The stationarity condition 9 ^a^R 2 ^ = for x' (y) was derived in ^01- Thus 



d < 'ax,|J) 2 ' > can be written as 



d{D 1 + D 2 
dx' (y) 



- - [ rfxPr(x)Pr(y|x) (52) 
n J 

x-x' (y) 



and, using Bayes' theorem in the form Pr(x|y)Pr(y) = Pr (y|x) Pr (x), this 
yields a matrix equation for the x' (y) 

= Pr (y) J dxPi (x\y) x- (n - 1) ^ {^J dxPr (x\y) Pr (y'|x)^ x' (y') - x' (y) 

(53) 

There are two classes of solution to this stationarity condition, corresponding 
to one (or more) of the two factors in equation being zero. 

1. Pr (y) = (the first factor is zero). If the probability that neuron y fires 
is zero, then nothing can be deduced about x' (y), because there is no 
training data to explore this neuron's behaviour. 
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2. n/dxPr(x|y)x = (n-l)Ey =1 (/ ^Pr (x|y) Pr (y'|x)) x' (y') + x' (y) 
(the second factor is zero). The solution to this matrix equation is the 
required x' (y). 

B.2 Stationary Pr (y\x) 

The stationarity condition siogfr(y\l.) ( w ith the normalisation constraint 

X^y=iP r (y'l x ) — 1) f° r P r (y| x ) wm now be derived. Thus functionally dif- 
ferentiate D\ +D2 with respect to logPr (y|x), where logarithmic differentation 
implicitly imposes the constraint Pr (y|x) > 0, and use a Lagrange multiplier 
term L = fdx'X (x') X^/=i P p {v'\ x ') to impose the normalisation constraint 
J2yLi P r (j/l x ) — 1 f° r eacn x i to obtain 

4 (n — 1 1 
- -i- ^Pr (x) Pr(j/|x) 



x'(y)- ^x-£Prfe|x) x' (y)^J 
A(x)Pr(y|x) (54) 



The stationarity condition implies that Y^=i ^ T (v\ x ) ^F^fer^ = 0> which 
may be used to determine the Lagrange multiplier function A (x) . When A (x) 
is substituted back into the stationarity condition itself, it yields 

M 



= Pr (x) Pr (y|x) £ (Pr (y'|x) - S y ,y>) 



y'=l 

x x' (y') ■ (^p- - nx+ (n - 1) ]T Pr (y"|x) x' (y") j (55) 

There are several classes of solution to this stationarity condition, corresponding 
to one (or more) of the three factors in equation being zero. 

1. Pr (x) = (the first factor is zero). If the input PDF is zero at x, then 
nothing can be deduced about Pr (y|x), because there is no training data 
to explore the network's behaviour at this point. 

2. Pr(y|x) = (the second factor is zero). This factor arises from the dif- 
ferentiation with respect to logPr(y|x), and it ensures that Pr(j/|x) < 
cannot be attained. The singularity in logPr(y|x) when Pr(y|x) = is 
what causes this solution to emerge. 
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3- Ey=i ( Pr (2/'l x ) - <W) x ' (J/0 •(•••)= (the third factor is zero). The 
solution to this equation is a Pr (y|x) that has a piecewise linear depen- 
dence on x. This result can be seen to be intuitively reasonable because 
D1+D2 is of the form J e?xPr (x) / (x), where / (x) is a linear combination 
of terms of the form x 1 Pr (y\x) J (for % = 0, 1, 2 and j = 0, 1, 2), which is a 
quadratic form in x (ignoring the x-dependence of Pr (j/|x)). However, the 
terms that appear in this linear combination are such that a Pr (y|x) that 
is a piecewise linear function of x guarantees that / (x) is a piecewise linear 
combination of terms of the form x ! (for i = 0, 1, 2), which is a quadratic 
form in x (the normalisation constraint X]yli Pr (yl x ) = 1 is used to re- 
move a contribution to that is potentially quartic in x). Thus a piecewise 
linear dependence of Pr (y|x) on x does not lead to any dependencies on 
x that are not already explicitly present in D\ + D-x. The stationarity 
condition on Pr(y|x) (see equation I55|) then imposes conditions on the 
allowed piecewise linearities that Pr (y|x) can have. 



C Simplified Expressions for D\ + D2 

The expressions for D\ and D 2 (see equation may be simplified in the case 
of joint encoding and factorial encoding. The case of joint encoding is derived 
in annendix lC.il and the case of factorial encoding is derived in annendix lC.2l 
In both cases it is assumed that x = (xi,x 2 ) and Pr(xi,x 2 ) = Pr(xi)Pr(x 2 ) 
where Pr (xi) and Pr (x 2 ) each define a uniform PDF on the input manifold. 



C.l Joint Encoding 



The expressions for D\ and D 2 may be simplified in the case of joint encoding, 
where x = (xi,x 2 ), y = (2/1,1/2) for 1 < y\ < \fM and 1 < y 2 < y/M. In 
the following two derivations of the expressions for D\ and D 2 the steps in the 
derivation use exactly the same sequence of manipulations. 
The expression for D\ is 

2 „ Vm Vm 

Di =- I dxidx 2 Pr(xi.x 2 ) ^ ^ Pr (z/i, 3/2 |xi, x 2 ) 
•* „ — 1 ^ — 1 



yi=i y2=i 



xi 

X 2 



x i {yi,y2) 
x 2 (j/1,2/2) 



The assumed properties of Pr(xi,x 2 ) imply that x' t (j/i, J/2) 
x 2 (2/1,1/2) = x' 2 (y 2 ), which gives 



(56) 

x[ (2/1) and 



D\ =^ /rfxidx 2 Pr(xi,x 2 ) ^ ^2 Pr (yi'2/2|xi,x 2 ) 
x (j|xi - x'j ( yi )f + ||x 2 - x 2 (2/2) 



(57) 
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Marginalise Pr (t/i, 2/2 |xi, X2) where possible, using that 



E„ 1= i p r (yi,2/ 2 |xi,x 2 ) 



Pr (2/ 2 |xi,x 2 ) 



Pr(z/ 2 |x 2 ) and 



E V2 =i Pr (yi,y2|xi,x 2 ) = Pr(yi|xi,x 2 ) = Pr(yi|xi), to obtain 



E^iWyiixOK-xUyoir \ 

+ ESPr(y2|x 2 )||x 2 -x 2 ( y2 )|| 2 J 



dxidx 2 Pr (xi,x 2 ) 



Marginalise Pr(xi,x 2 ) where possible, using that / (ixi Pr (xi, x 2 ) = Pr(x 2 ) 
and J <ix 2 Pr (xi, x 2 ) = Pr(xi), to obtain 

2 r ^ 
D l — / da; 1 Pr(x 1 ) V Pr ( yi | Xl ) || Xl - x' x ( yi )|| 2 



2 r VM 

+ - / dx 2 Pr (x 2 ) Pr (2/2M ||x 2 - x' 2 

3/2=1 



(V2W 



(59) 



Because of the assumed symmetry of the solution, these two terms are the same, 
which gives 

4 r ^ 
£>! = - / dxiPr( Xl ) Y, Pr^ilxOllxr-x'^yOII 2 (60) 



yi=i 



The expression for D 2 is 
2(n- 1) 



D 2 =- 



Xl 

x 2 



y dxidx 2 Pr (xi,x 2 ) 

\/M / / / n 

/1 = 1«2 = 1 



!/l = l J/2 

Use that xi (2/1,2/2) = xi (2/1) and x 2 (2/1,2/2) = x 2 (2/2) 
2(n- 1) 



(61) 



Do =- 



+ 



dxidx 2 Pr(xi,x 2 ) 



x i -E^fiE^i Pr (yi^2|xi,x 2 ) xi(2/i) 

x 2 -E^flE^l Pr (yi^2|xi,X 2 ) x' 2 (2/ 2 



(62) 



Use that E yi =i Pr (2/i,2/2|xi, x 2 ) = Pr(2/ 2 |x 2 ) and Ej, 2 =i Pr (Vu 2/2 |xi, x 2 ) 
p r(2/i| x i)- 



Da = 



2(n- 1) 



dxi<ix 2 Pr (xi,x 2 ) 



x i - E Vl =i Pr (?/il x i) x 'i (2/1) 



x 2 - E„ 2 =i Pr (?/2|x 2 ) x' 2 (y 2 ) 



(63) 
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Marginalise Pr (xi, x 2 ) 
2(n- 1) 



Do 



1 



dxi Pr (xi) 



IM 



x i - XI Pr (yil x i) x i (yi) 



2/1=1 



+ 



Use symmetry. 



2(n-l) 



/ 



dx 2 Pr (x 2 ) 



Do 



4(n- 1) 



/ 



dxi Pr (xi) 



x 2 - J! Pr (y2|x 2 ) X 2 (?/2) 
2/2 = 1 



x i - Pr (yil x i) x i (yi) 

2/1=1 



These results may be combined to yield finally 



4 r VM 

£>i +^2 =- / dxiPr(xi) V Pr(j/i|xi)||xi -xi(j/i 



+ 



4(n- 1) 



n 



/ 



dxi Pr (xi) 



x i - Pr (yil x i) x i (s/i) 

2/1 = 1 



(64) 



(65) 



(66) 



which has the same form as Di + Z? 2 would have had for Xi-space alone, with 



the replacement M 



M 



followed by multiplication by a factor 2 overall. This 



implies that the problem of optimising a joint encoder is trivially related to the 
problem of optimising an encoder in the xi-space alone. 

C.2 Factorial Encoding 



The expressions for D\ and D 2 may be simplified in the case of factorial encod- 
ing. In the following two derivations of the expressions for Di and D 2 , the steps 
in the derivation use exactly the same sequence of manipulations, except that 
D 2 has one additional step which separates the contributions inside || - ■ ■ || • 
The expression for D\ is 

2 r M 

Di =-j rfxirfx 2 Pr (xi , x 2 ) y~] Pr (t/|xi , x 2 ) 



2/=i 



Xl 

x 2 



x 'i (y) 
x ' 2 (i/) 



(67) 
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Split up Pr(y|xi,x 2 ), using that Pr(y|xi,x 2 ) = \ Pr (j/|xi) + \ Pr (y|x 2 ), which 
gives 

1 /" M 
£>i=- / dxidx 2 Pr(xi,x 2 )5^(Pr(y|xi)+Pr(y|x 2 )) 
nJ v=i 



Xl 

x 2 



xi (y) 
x 2 (y) 



(68) 



Assume that the input manifold is such that x^ (y) = for — + 1 < y < M, 
and x 2 (y) = for 1 < y < f-. Also use that Pr (j/|xi) = for f- + 1 < y < M, 
and Pr (y|x 2 ) = for 1 < y < 4^, to obtain 

M 

£>i =1 y dx 1 dx 2 Pr(x 1 ,x 2 )^Pr(y|x 1 ) ^ ^ 

If M / 

+ - / dxidx 2 Pr(xi,x 2 ) 51 Pr (y|x 2 ) I 







Xl 

x 2 





4 (tf) 



(69) 



Because of the assumed symmetry of the solution, these two terms are the same, 
which gives 



Di = ^ f rfxirfx 2 Pr(x 1 ,x 2 )^Pr(t/|xi)(||xi-x' 1 

y=l 



2 + l|x 2 f 



(70) 



Marginalise Pr(xi,x 2 ) where possible, using that / dxi Pr (xi, x 2 ) = Pr(x 2 ) 
and / rfx 2 Pr(xi,x 2 ) =Pr(xi), to obtain 



2 Jrfxi Pr ( Xl ) E7=i p r ||xi - ^ (y)||' 

n \ +/rfx 2 Pr(x 2 )||x 2 || 2 

The expression for D 2 is 
2(n- 1) 



(71) 



n 



| dxirfx 2 Pr(xi,x 2 ) ( £ ) -f]Pr(j/|x 1 ,x 2 ) ( ^ 



Use that Pr (y|xi,x 2 ) = § Pr (y|xi) + § Pr (y|x 2 ). 
2(n- 1) 



(72) 



D 2 =- 



Xl 

x 2 



dxidx 2 Pr (xi,x 2 ) 

M 

2 



1 M / ' 

-£(Pr(y|xi)+Pr(y|x 2 )W J 

y=l ^ ^ 



x'i (y) 
2 (y) 



(73) 
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Separate the contributions from the upper and lower components inside 
to obtain 



Do = 



2(n- 1) 



xi 

x 2 



J dxidx 2 Pr (xi,x 2 ) 
-lEjiPr(y|xi) 



xi (y) 





iEf=MPr(2/|x 2 ) 





x 2 (y) 



(74) 



Use that xi (y) = for f + 1 < y < M, and x 2 (y) = for 1 < y < f. Also 
use that Pr (y|xi) = for f - + 1 < y < M, and Pr (y|x 2 ) = 0. 



2(n- 1) 



n 



rfxirfx 2 Pr (xi,x 2 ) 



+ 



2(n- 1) 



/ 



dxirfx 2 Pr(xi,x 2 ) 



1 2 

xi - -^Pr(j/|xi)xi (y) 

M 

x 2 -- £ Pr(y|x 2 )x' 2 (y) 



(75) 



Use symmetry. 

4(n- 1) 



D 2 



n 



dxidx 2 Pr (xi,x 2 ) 



1 2 

xi - -5^Pr(y|xi)jfi (») 



i/=i 



(76) 



Marginalise Pr (xi, x 2 ). 

£ 2 = 1^ 



J dxx Pr (xi) 



1 2 

xi - -^Pr(y|xi)xi (y) 



3/=l 



(77) 



These results may be combined to yield finally 

£>!+£> 2 =- / dx 2 Pr(x 2 )||x 2 || 2 
n J 

M 

+ lfd^ Pr(x 1 )^Pr(y|x 1 )||x 1 -x' 1 ( 2/ )|r 



+ 



4(n- 1) 



y dxi Pr (xi) 



xi-|E Pr ^i x i) x 'i(y) 



y=i 



(78) 



The stationarity conditions may be derived from this expression for the fac- 
torial encoding version of D\ + £> 2 . The stationarity condition w.r.t. Pr (y|xi) 
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is 



£ (Pr(y'|xi) - <W)*'i (?/')• ^ (J/') - »xx + ^ f] Pr(y"|x 1 )x' 1 (y") = 

y' = l \ y" = l J 

(79) 

and the stationarity condition w.r.t. x' x (y) is 

M 

n J dx. x p r (x 1 |y)x 1 =x' 1 (y) + ^-i / dx a Pr ( Xl |y) ^ Pr (y'| Xl ) xi (y') 

(80) 

Both of these stationarity conditions can be obtained from the stan- 
dard ones by making the replacements (n — 1) Yltf=i P f (?/l x i) x 'i iv') ~ * 
^ £?=i Pr (y'|x x ) xi (y') and M - f . 

D Minimise D x + L> 2 

The expression for D\ + Di needs to be minimised with respect to the reference 
vectors x' (y) and the posterior probabilities Pr(y|x). There are four cases to 
consider, which are various combinations of circular/toroidal input manifold 
f appendices ID. II and ID.2j /appendices ID. 31 and ID.4J1 and two/three overlapping 
posterior probabilities ( appendices ID . II and ID ,3| / appendices ID .21 and ID ,4Jl . For 
a toroidal manifold it is not necessary to consider the case of joint encoding, 
because it is directly related to encoding a circular manifold, which is dealt with 
in appendices ID . II and ID . 21 



D.l Circular Manifold: 2 Overlapping Posterior Proba- 
bilities 

For < s < the functional form of p (9) that ensures a piecewise linear 
Pr (y\x) is 

f i o<\e\<^- s 
p(fi) = { fifi) ^~s<\6\<^ + s (si) 
I o \e\ > fj + a 

where / (9) = a + bcos9 + csin|0|, Continuity of p (9) gives / (fj- — s) = 1 
and / { jj + s) =0. Normalisation of p (9) in the interval jj — s < 9 < + s 
requires that / (0) + f (§f - 6) = 1 . These yield / (0) in the form 

1 1 sin ( £ - 6) 

f{0) = - + -—^¥- 1 (82) 

2 2 sin s 
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D\ + Z?2 must be stationary w.r.t. variation of p (9) in the interval jj — s < 9 < 
jj + s, which yields the condition 



=rcsc 2 s sin ( — ) sin ( — — 9 ) (sins — sin ( — — ) ) 



x (n sin s — (n — 1) r sin 
which gives the optimum solution for r as 



J) 



sm s 



n — 1 sin i 



(83) 



(84) 



M) 



D\ + D-2 must be stationary w.r.t. variation of r. This yields a transcendental 
equation that must be satisfied by the optimum solution for s as 



sin s n — 1 M . / 7r \ . 
sm — (coss + ssins) =0 



Di and D% may be written out in full as (using n(0) = (cos 9, sin 9)) 



(85) 



2M 
Di = 

U7T 



J^- s ^||n(0)-rn(O)|| 2 
+ /?_ 5 ^/(0)l|n(0)-rn(O)|| 2 

A'/ 

V +/#_.*»/(^-*)ii"W-rn(^)ir ; 



Do 



2(n- 1)M 



n7r 



S*-'d0\\n(O)-m(O)\f 



d0 



n(9)-rf(6) n(0) 
-r/(^-»)n(^) 



(86) 



(87) 



The optimum / (9) and r may be substituted into D\ + Z?2, the integrations 
evaluated, and then the condition that the optimum s must satisfy may be used 
to simplify the result, to yield the minimum D\ + Di as 



D 1 + D 2 = 2 - 



n M 
n- 1 2tt 



(2s + sin (2s)) 



(88) 



D.2 Circular Manifold: 3 Overlapping Posterior Proba- 
bilities 

For < s < the functional form of p (9) that ensures a piecewise linear 
Pr (y|x) is 



f fi(9) 0<\9\< 



p(9) 



h{6) 
h(0) 



M 



M 



■3 < \9\ < 

s<\o\<fr + s 



3tt 

M 



3tt 
M 



(89) 



where fi (9) = <n + b; t cos 9 + Ci sin \9\ for i — 1,2,3. Continuity of p (9) gives 
h(-W + °) = h(-fl+s), hiw -*) =h(%-*) and/ 3 (^ + s) =0. 
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Normalisation of p (9) in the interval < 9 < — + s requires that f\ (9) + 
/ 3 (w ~^ ^)+/3 (if — ^) = 1) an d normalisation of p (9) in the interval — j^+s < 
6 < jj- — s requires that f 2 (9) + f 2 (^j — 6) = 1. These conditions may be used 
to eliminate all but a pair of parameters in the fi (9), which may thus be written 
in the form 



s 
2ir 



A (61) =1 cos (61) sec (^-*) 

+ ai (l- cos (6>) sec (^ 

+ & 2 cos(0)csc(^)sin(g- a J sec(^ 
/ 2 (0)=l + & 2 (cos (0)- cot (^) sin(^ 



/3 W =i(l-csc(|- 2s )sin(|- S - 



+ ^ 1 ( C ° S (S - V SCC (If - S ) \ 
+ b 2 esc (£) csc (£ - 2,) sin (| - s) sin + - - (?) (90) 

Z?i + D 2 must be stationary w.r.t. variation of p (9) in each of the 3 intervals 
< 9 < -jj + s (interval 1), -jj + s <9 < -s (interval 2), and ff-s < 9 < 
jj + s (interval 3). The Fourier transform w.r.t. 9 of each of these 3 stationarity 
conditions has 5 terms with basis functions (1, cos 9, sin 9, cos 29, sin 29), and 
each of the total of 15 Fourier coefficients must be zero. There are only 3 free 
parameters a\, b 2 and r, so only 3 of the 15 are actually independent; the 
particular 3 that are used are selected on the basis of ease of solution for the 
free parameters 01, 62 and r. The coefficient of the cos 29 term in interval 2 
yields 

'2tt s 



b 2 r (n + 2b 2 r -2b 2 rn)cos\^— j =0 (91) 
which has the solution 

n 

2 (n — 1) r 

which may be substituted back into the coefficient of the cos 9 term in interval 
1 to yield 



=rsec ( V7 — s] sin ( — ) 



(n-1) (-6a? + 7ai-2) rsin(^) 
+ (n-l) (2 a 2_3 ai + i) rsin(ff) ] (93) 
+n (01 sin - «) + (1 - ai) sin (f - *)) 
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and also substituted back into the coefficient of the sin 9 term in interval 3 to 
yield 



= r cos (£) esc - S ) sec - a) sin 2 



V 



(n — 1) r 



+n 



m) 

( -2ai(3oi-2)cos(^-a) 
-2 (ai-1) aiCos(^ + s) 
+ (l-2 ai +2a 2 ) cos (ff - s 
y + (l - 4ai + 6 a 2 ) cos (s) 
( 0l + l)cos(ft) - (ai-l)cos (^) 
-2aisin(^)sin(^-2s) 



\ \ 

/ 
J 



(94) 



These two conditions may be solved for a± and r to yield 



oi = 



cos(f)-l 



and 









sec (If) 







(95) 
(96) 



The solutions for a\ and 62 may be substituted back into the expressions for the 
/, (6) to reduce them to the form 

A M-i («■(£- 



- coss — 2 cos f 4t1 cos0 | esc 2 ( — \ sec ( — 



~2 /^L 



3tt 
M 



2tt 
M 



(97) 



D\ + D2 must be stationary w.r.t. variation of r. This yields a transcendental 
equation that must be satisfied by the optimum solution for s as 



lcos(^-s) n-lM 

cos . 

n 7r \M 



sm 



2tt 



n cos(^) n 7T KM J \ \M 

Di and D2 may be written out in full as 
/ / 



2tt 
M 



cos 



2tt 
M 



(98) 



M 

Di = — 



+/3 
V +/3 ( 



r+s 



+f: 



d9 



h (0) 1 


n(0) - rn(0)| 


2 






>) ||n (6>) -rn( 


2tt\ 
M / 


i2 




||n (0) — r n (- 


27r\ 
Ml 


||2 


/2 (*) 


||n(0) -rn(O) 


I 2 




/27T 

2 VM 


0) n(9)-rn 


(f) 


||2 


/3(<?) 


n(0)-rn(O)| 


2 






9) n(0) -rn( 


2tt\ 
M ) 


i2 




9) n(0) -rn| 


4tt\ 
M / 


|2 



(99) 



40 



D 2 



(n- 1)M 



mr 



V 



Jo 



(19 

\ B dB 

de 



n(9)-f 1 (9) rn(0) 

■A(S + »)r»(-S) 
n(0)-/ 2 (0) rn(0) 

-/ 2 (f 

n(0)-/ 3 (0) rn(0) 

J- /47T 

_ /3 - 



(100) 



/ 



The optimum /j (0) and r may be substituted into £>i + Z?2, the integrations 
evaluated, and then the condition that the optimum s must satisfy may be used 
to simplify the result, to yield the minimum D\ + D 2 as 



A +D 2 



2(n- l) 2 

n((n-l)(2-f S )+ S ec 2 (^)) /4tt 



2(n-l)' 



cos 



M 



- 2s 



(101) 



D.3 Toroidal Manifold: 2 Overlapping Posterior Proba- 
bilities 

For < s < jj the functional form of p (9) may be obtained directly from the 



M 

circular case with the replacement M 

1 

/ 




M 
2 ' 



so that 



p{6) 



f(0) %-s<\9{<% 
\0\>^ + s 



sins 



(102) 



(103) 



D\ + D 2 must be stationary w.r.t. variation of p (9) in the interval |j — s < 



< jj + s, which yields the condition 



=r esc s sin I — sin I — 



2tt 



M 



2tt 



M 



. . 2tt 
sin s — sin I — — 9 



x I 2 n sins — (n — 1) r sin 



(104) 



which has the same form as the circular case with the replacements M — > 4^ 
and n — » , which gives the optimum solution for r as 

2n sin s 



n-lsin(f) 



(105) 
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D\ + D2 must be stationary w.r.t. variation of r. This yields a transcendental 
equation that must be satisfied by the optimum solution for s as 



sins 



1 M 



sin (ff) n+l2n 



sin 



(cos s + s sin s) = 



(106) 



which has the same form as the circular case with the replacements M 
and n — > D\ and D2 may be written out in full as 



M 



I 



Di = — 



M 
mr 



f^- s de(l + \\n(9)-rn(0)\\ 2 ) 
+ jf_ s d0f(e) (l + ||n(tf)-rn(0)|| 2 ) 
^ +f 2 g. s d0f(%-e) (l + ||n(fl)-rn(f)|| 2 ) ) 



Do 



2(n- l)M 



nir 



( 



V 



/„" s d6\\n{6) - ±rn(0) 

+ fS_de 



n{9)-\rf{9) n(0) 
-hrf(%-0)n(%) 



(107) 



(108) 



The optimum / (6) and r may be substituted into D\ + D 2 , the integrations 
evaluated, and then the condition that the optimum s must satisfy may be used 
to simplify the result, to yield the minimum D\ + D2 as 



/) + D 2 =4 ^—r^- (2s + sin (2s)) 



n - 1 2tt 

which has the same form as the circular case plus an extra contribution of 2 



(109) 



D.4 Toroidal Manifold: 3 Overlapping Posterior Proba- 
bilities 

For jj < s < jj the functional form of p (9) may be obtained directly from the 
circular case with the replacement M — > 4p, so that 



p{9) 



h{6) O<|0|<-f + s 

h{9) -% + s<\6\<^-s 

h{6) %-8<\e\<% + 8 





(110) 



> 4- g 
— M 



42 



h (&) =^cos(0) sec - s 

+ ai ^1 — cos (9) sec (^j^ ~ 

+ b 2 cos (9) esc (g) sin - sec - s 
/ 2 (0) =- + & 2 (cos (0) - cot {^j sin (0) 
h{0)=\ (l-csc(g-2 S )sin(g- S 

+ ( C ° S - SCC (I - S ) - 1 

+ >>, esc csc ( J - 2 «) sin - ,) sin + - ) (1 11 ) 



-Di + D 2 must be stationary w.r.t. variation of p (9) in each of the 3 intervals 
n ^ fl ^ "if + s (interval 1), -ff + s<9<^-s (interval 2), and - ° ^ 



$ < if + s (interval 3) . The coefficient of the cos 29 term in interval 2 yields 

for (n + fo r-b 2 rn) cos =0 (112) 

which has the same form as the circular case with the replacements M — > 4^ 
and n — > , which has the solution 

&2 = 7 7T 113 

(n — 1) r 

which may be substituted back into the coefficient of the cos 9 term in interval 
1 to yield 



=r sec 



2tt \ . /2tt 

s sin — 

M ) \M 



(n - 1) (-6 a\ + 7 on - 2) r sin ( 



2tt * 



+ (n-l) (2a?-3ai + l) rsin(ff) ) (114) 
+2n (ai sin (f - s) + (1 - 01) sin (§f - s)) 
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and also substituted back into the coefficient of the sin 9 term in interval 3 to 
yield 



,27T \ /2tt 
=r cos — esc — 
Ml \M 



sec 



2tt 
M 



sin 



— (n — 1) r 



V 



+2n 



/ -2ai(3ai-2)cos(^-s 
-2 (oi - 1) oi cos + s 
+ (l-2 ai + 2a2) cosf^f-s 

\ + (l - 4 oi +6 a?) cos(s) 
(ai + l)cos(5)-(oi-l)cos(if) 
-2oisinff)sin(f -2s) 



) 

J 



(115) 



both of which have the same form as the circular case with the replacements 
M — > 4f and n — > ^pr- These two conditions may be solved for ai and r to 
yield 



cos 



ai = 



\M ) 



c°s (If) 



and 



2n ( 4tt 

7 cos — 

n-1 \M 



2tt\ 

m) 



(116) 



(117) 



The solutions for ai and 62 may be substituted back into the expressions for the 
fi (9) to reduce them to the form 



s + cos s 



AM- -\ (- (| 



/2tt 
2C0S, M 



sm 



2tt 
M 



1 2 
I csc 



67T \ / 67T 

cos l — - 9 sec — 

m / Vm 



cos I 



esc 



/4tt 
sec I — — s 
\M 



(118) 



which have the same form as the circular case with the replacement M — > 4^. 
Z?i + I?2 must be stationary w.r.t. variation of r. This yields a transcendental 
equation that must be satisfied by the optimum solution for s as 



lcos(if-s) n-\M /2tt 

v ni cos 

n cos (ff) 2n 2tt \ m 



sm 



47T 

~M 



/4tt 
VM 



/4tt 
cos — — s 
\M 

(119) 



= 
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which has the same form as the circular case with the replacements M — > 
and n ^i-. D\ and Z?2 may be written out in full as 



M 
2mr 



( 



So 

+ 12 
+ L 



■Z7V I 



de 



MO) (l + ||n(tf)-rn(0)|| 2 ) 
+/s(ft-«) (l+||n(*)-rn(£) 



+h{ j M + 8) (l + ||n(fl)-rn| 



M j 
'Ml 



1+8 



de 



Md) (l + ||n(0)-rn(O)!| 2 ) 



+f2(%-9) (l+||n(0) 



de 



V 



V 




n(6) -rn(0)||- 



l+||n(0)-rn(f; 



(120) 



Da = 



(n- 1)M 



n7r 



— TT+S 



rf6» 
d0 



Jo 
+ /. 

IVt 1 " 

+ JeJ_d6 



My 



n(0)-§/i(0) rn(0) 
-1/3 (f -»)ro(g: 

n(0)-±/ 2 (0) rn(0) 

-iA(5-»)"'(S) 

nffl-l/3(0)rn(0) 

If /47T 

2/1 ¥ 



< 4"7T 

rn =7 



(121) 



V 2^ 3 VM 

The optimum /j (e) and r may be substituted into D\ + D2, the integrations 
evaluated, and then the condition that the optimum s must satisfy may be used 
to simplify the result, to yield the minimum D\ + D2 as 



D 1 + D 2 =- 



((»-l)(2^ -f ^)- 2 -c 2 (t)) 



«((»-!) (2-f S )+2 SeC 2 (f f)) _ (8 
(n-1) 2 



cos 2s 

M , 



(122) 
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